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MICHAEL  A.  STEPHENS 

Simon  Fraser  University,  Burnaby,  B.  C.,  Canada  V5A  1S6 


Abstract 

In  this  article,  two  important  methods  of  testing  fit  to  a  distribution  axe  discussed  and  compared. 
They  are  the  family  of  tests  based  on  the  empirical  distribution  function  of  a  random  sample, 
and  the  family  based  on  plotting  the  order  statistics  against  a  suitable  set  of  constants  and 
examining  the  fit  of  a  line  through  the  plotted  points.  The  two  sets  will  be  called  EDF  tests 
and  Regression  tests  respectively. 

Key  Words:  Correlation  tests;  EDF  tests;  Probability  plot;  Regression  tests;  Tests  for  exponen¬ 
tially;  Tests  for  normality. 


1  The  goodness-of-fit  problem 

Suppose  a  random  sample  of  n  values  xj,  xa,  x®, . . . ,  x„  is  given,  and  it  is  desired  to  test 
that  the  sample  comes  from  the  distribution  F(x;  0).  The  parameter  0  represents  a  vector 
of  parameters  in  the  distribution;  they  may  all  be  known,  so  that  the  tested  distribution 
is  completely  specified  —  this  situation  will  be  called  Case  0  —  or  some  or  all  of  the 
parameters  may  have  to  be  estimated  from  the  sample.  Thus  a  test  might  be  required 
of  the  hypothesis  that  the  sample  comes  from  a  normal  distribution  with  mean  ft  and 
variance  o3,  or  that  a  sample  comes  from  a  Gamma  distribution  with  scale  parameter  P 
and  shape  parameter  m.  For  the  present,  we  assume  the  distribution  F(x;  6)  is  continuous. 
We  shall  sometimes  write  the  distribution  as  F(x)  for  brevity. 
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2  EDF  tests 

Xhccmpirical  distribution  function  of  the  sample  is  defined  as  follows: 


The  EDF  thus  represents,  for  any  value  of  x,  the  fraction  of  the  observations  less  than 
'or  equal  to  x;  it  clearly  parallels  F(x),  which  gives  the  probability  that  an  observation  is 
less  than  x.  In  fact,  by  the  Glivenko-Cantelli  lemma,  |F«(x)  -  F(x)|  — ♦  0  as  n  — ♦  oo.  In 
1933  Kolmogorov  proposed  a  test  baaed  on  the  discrepancy  x*(x)  =  F„(x)  —  F(x),  and 
Smirnov  followed  by proposing  two  related  tests.  The  Kolmogorov-Smimov  tests,  as  they 
have  come  to  be  called,  are  defined  as: 

i  i 

'  t 

|  .  j  *  supK(x)};  D.  =  sup{-s.(x)}. 

I  *  f  *  * 

The  statistic  actually  introduced  by  Kolmogorov  was  D  =  ma x(D+,D_).  At  about 
the  same  time,  Cramdr  and  von  Mises  were  considering  tests  based  on  the  integral  of 
Zn(x).  The  Cramer-von  Mises  family  of  statistics  is 

C  =  nr  {xn(x)}atf>(x)dF(x)i 

J—  OO 

where  0(x)  is  a  weight  function  which  can  be  used  to  vary  the  importance  of  different  parts 
of  the  x-axis.  Two  commonly-used  weight  functions  are  ^(x)  =  1,  giving  the  Cramer-von 
Mises  statistic  W2,  and  ^(x)  =  {F(x)[l  —  F(x)]}~*,  giving  the  Anderson- Darling  statistic 
A2.  In  addition,  W2  can  be  modified  to  yield  Watson’s  statistic  U2  given  by 

U2  —  n  c< -•  (*)  -  F(x)  -  £>„(*)  -  F(i)]  iF(x). 

2.1  Computing  formulas 

The  definitions  of  these  statistics  look  rather  difficult  to  handle,  but  in  fact  very  easy 
computing  formulas  exist.  They  are  derived  by  means  of  the  Probability  Integral  Trans¬ 
formation  (PIT).  This  is  the  transformation 


*  =  F(x;0). 

It  is  well  known  that  this  transformation  gives  a  variable  z  which  is  uniformly  dis¬ 
tributed  between  0  and  1,  written  17(0, 1).  If  the  Kolmogorov-Smimov  and  Cramdr-von 
Mises  statistics  are  now  calculated  from  the  EDF  of  the  r-values,  with  F(z)  =  z,  the  uni¬ 
form  distribution,  it  may  easily  be  shown  that  the  values  are  the  same  as  those  calculated 
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from  the  original  z-diagram.  The  2-diagram  then  gives  the  computing  formulas  following, 
with  2(,-)  =  F(x(t),  0): 


D+ 

D~ 

D 

W3 

U 3 
A3 


max[t/n  -  Z(,)]; 
nuut[2(o  -(*-  l)/n); 
msx(D+1D~); 

1  ,  v-/  2i  -  1  \3 

jS+Efw  2n  /  ! 

W3  -  n(z  —  0.5)a  (where  z  =  2(o/n); 

-n-  -  53(2*  -  1) (log  2(,-)  +  log(l  -  2(»+i_o)]. 
n  i 


(1) 


2.2  Estimated  parameters 

Suppose  one  or  more  components  of  0  are  unknown,  but  are  estimated  by  an  efficient 
method  from  the  sample  values.  These  values  are  then  inserted  where  necessary  in  the 
PIT  above,  and  the  statistics  are  calculated  from  the  resulting  2-values  using  the  formulas 
(1).  The  unordered  2-values  are  not  now  uniformly  distributed;  we  describe  them  as  super¬ 
uniform,  because  they  almost  always  give  much  smaller  values  for  the  statistics,  implying 
that  the  2-values  are  more  evenly  spaced  than  a  genuine  uniform  sample. 


2.3  Another  transformation  to  uniformity 

It  is  well-known  that  if  events  are  occurring  randomly  in  time,  say  at  times 
(the  clock  is  started  at  time  zero),  and  if  the  values  are  transformed  by  Z(.)  =  <(,)/<(»),  the 
set  of  n  -  1  values  2(t-),i  =  1,2, . . .  ,n  -  1,  will  be  distributed  17(0, 1).  An  interesting  set 
of  events  which  gives  superuniform  Z(,-)  are  the  ends  of  reigns  (deaths  or  abdications)  of 
the  Kings  and  Queens  of  England,  starting  with  time  zero  as  the  accession  of  William  I  in 
1066  —  it  is  hard  to  explain  this  phenomenon,  even  though  it  is  obvious  that  successive 
reigns  have  lengths  which  are  correlated:  see  Pearson  [1]. 


2.4  Distribution  theory 

When  the  continuous  distribution  tested  is  completely  specified  (this  is  called  Case  0),  so 
that  the  test  of  fit  becomes  a  test  that  the  2-values  are  uniformly  distributed,  percentage 
points  of  the  EDF  statistics  are  either  known  exactly,  or  can  be  approximated  very  accu¬ 
rately.  Details  and  tables  are  given  by  Stephens  [2].  Furthermore,  it  is  possible  to  modify 
the  statistics  so  that  only  the  asymptotic  points  need  be  tabulated.  To  do  this,  a  modified 
form  T*  of  the  EDF  statistic  T  is  used  which  is  an  easily  calculated  function  of  T  and  the 
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sample  size  n.  The  resulting  T*  is  then  compared  with  the  asymptotic  points  of  the  test 
statistic.  Modified  forms  and  tables  are  given  in  Biometrika  Tables  for  Statisticians,  Vol 
II,  Table  54,  and  also  in  Stephens  [2]. 

When  parameters  are  estimated  efficiently  (that  is,  with  asymptotic  variances  given 
by  the  inverse  of  the  Fisher  information  matrix),  and  used  in  the  PIT  to  give  the  z-values 
from  which  the  statistics  are  calculated,  asymptotic  percentage  points  can  be  calculated 
for  statistics  of  the  Cramer-von  Mises  family.  These  include  W2,  U2,  and  A2.  The 
asymptotic  points  depend  on  the  distribution  being  tested  ,  but  not  on  location  or  scale 
parameters  in  the  distribution;  however,  they  do  depend  on  shape  parameters  such  as 
occur  in  the  Gamma  or  Weibull  or  von  Mises  distributions.  The  points  for  finite  n  would 
be  very  difficult  to  calculate,  and  would  have  to  be  determined  by  Monte  Carlo  methods. 
Fortunately,  for  these  statistics,  the  finite-n  points  converge  very  quickly  to  the  asymptotic 
points,  so  that  the  latter  may  be  used  for  practical  purposes  —  a  test  with  very  small 
sample  size  would  in  any  case  have  very  little  power. 

For  Kolmogorov-Smimov  statistics  the  distribution  theory  is  more  difficult.  Again, 
points  will  not  depend  on  the  true  values  of  location  or  scale  parameters,  but  even  asymp¬ 
totic  points  are  very  difficult  to  calculate.  Such  tables  as  exist  have  usually  been  found 
by  Monte  Carlo  methods.  In  addition,  points  for  finite  n  do  not  converge  rapidly  to  the 
asymptotic  points  for  these  statistics,  so  that  it  is  necessary  to  give  either  the  finite-n 
points  (obtained  by  Monte  Carlo)  or  modified  forms,  as  was  done  for  Case  0. 

For  both  families  of  statistics,  extensive  tables  of  points  are  given  by  Stephens  [2] 
for  testing  for  the  normal,  exponential,  Gamma,  Weibull,  extreme-value,  von  Mises  and 
Cauchy  distributions,  so  that  the  tests  are  available  for  practical  use. 

2.5  Power 

The  power  of  a  test  statistic  will  of  course  depend  on  several  factors,  including  the  size 
(or  a-level)  of  the  test,  the  sample  size,  and  especially  on  the  alternative  to  the  tested 
distribution.  Nevertheless,  some  general  remarks  can  be  made  concerning  the  power  of 
EDF  statistics: 

1.  As  two-sided  omnibus  tests  (that  is,  tests  against  all  alternatives,  or  at  least  a  wide 
range  of  alternatives),  the  Cram6r-von  Mises  family  is  more  powerful  in  general  than 
the  Kolmogorov-Smimov  family.  This  might  be  expected,  as  the  former  “tests”  the 
hypothesized  distribution  all  along  the  range  of  values  of  x,  while  the  latter  looks 
for  a  marked  discrepancy  between  the  EDF  and  the  hypothesized  F(x),  possibly 
only  around  one  point. 

2.  For  Case  0,  there  is  a  difference  in  power  between  the  statistics,  according  to  whether 
the  alternative  distribution  is  mostly  a  change  in  the  location  of  the  distribution,  or 
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a  change  in  the  scale.  W2  and  A3  will  detect  a  change  in  location,  and  U2  a  change 
in  scale.  The  Kolmogorov  statistic  D  also  detects  a  change  in  location. 

3.  If  there  is  a  change  in  location,  and  the  direction  is  known,  the  statistics  D+  or  D~ 
can  be  very  powerful;  however,  if  the  wrong  statistic  is  used,  the  power  can  easily 
be  less  than  the  a- level  —  that  is,  the  test  is  biased.  D+  detects  the  situation  where 
the  true  location  is  less  than  that  tested,  and  D~  detects  the  opposite  situation. 

4.  When  parameters  are  estimated,  the  differences  between  the  powers  for  these  various 
types  of  alternative  tend  to  fade,  although  the  Cramer* von  Mises  family  will  still  be 
better  overall  than  the  Kolmogorov*  Smirnov  tests. 

5.  On  the  whole,  the  recommended  test  statistic  is  the  Anderson-Darling  A3;  it  is 
particularly  effective  in  detecting  outliers,  that  is,  observations  which  are  further 
into  the  tails  than  expected,  and  this  is  often  the  situation  which  the  tester  most 
wishes  to  detect. 

Further  details  on  all  these  statistics  are  given  by  Stephens  [2];  a  discussion  of  their  use, 
and  comparisons  with  other  statistics,  for  the  “observations  random  in  time”  situation 
described  briefly  above,  is  in  Stephens  [3]. 

3  Regression  tests 

3.1  Introduction 

For  the  second  part  of  this  paper,  we  describe  another  group  of  tests,  to  be  called  regression 
tests.  They  are  based  on  a  well-established  and  popular  technique  for  testing  fit  to  selected 
distributions,  the  probability  plot  In  regression  tests,  the  order  statistics  i(.)  of  a  sample 
are  plotted  on  the  vertical  axis  of  a  graph,  against  U,  a  set  of  constants  which  depend 
only  on  *,  along  the  horizontal  axis.  (In  the  probability  plot,  the  axes  were  reversed,  but 
for  convenience  in  introducing  test  statistics  we  keep  them  as  above).  The  constants  i, 
are  chosen  so  that  the  relationship  between  the  X(,)  and  t,-  is  approximately  a  straight 
line.  Historically,  the  linear  relationship  was  often  judged  by  eye,  but  more  recently,  test 
statistics  have  been  developed,  based  on  the  parameters  associated  with  the  straight-line 
fit,  when  this  is  done  by  ordinary  or  generalised  least  squares. 

Regression  tests  arise  naturally  when  unknown  parameters  in  the  tested  distribution 
F(x;  0)  are  location  and  scale  parameters.  Suppose  F(x;  0)  is  Fo(to),  where  Fo(w)  is  a 
completely  specified  distribution  and  to  =  (x  —  a)//?;  then  0  =  (a,  ft)  with  a  a  location 
parameter  and  ft  a  scale  parameter.  A  sample  with  order  statistics  X(,)  can  be  derived 
from  a  set  of  values  w  from  Fo(w)  with  order  statistics  tz>(,),  by  the  relationship 

X(i)  =  a  +  ftu>(i),  i  =  1, . . . 


,n. 


(2) 
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An  obvious  example  is  the  test  for  normality,  where  the  density  of  w  is  given  by  /(id)  = 
(2x)-1/3  exp(— u>2/2).  Let  ♦(to)  =  f(t)dt ;  then  F0(w)  =  ♦(t»)  and  F(x;6)  =  ${w) 

with  to  =  (x  —  p)/a. 

In  the  more  general  case,  let  m,  =  E(w^);  then,  from  (2)  we  have 

E(x{i))  =  a  +  0mi  (3) 

and  a  plot  of  X(,j  against  m,  should  be  approximately  a  straight  line  with  intercept  a  on 
the  vertical  axis  and  slope  fi.  The  values  m,  are  the  most  natural  values  to  plot  along  the 
horizontal  axis,  but  for  most  distributions  they  are  difficult  to  calculate.  Various  authors 
have  therefore  proposed  alternatives  U  which  are  convenient  functions  of  i;  then  (3)  can 
be  replaced  by  the  model 

x(i)  =  a  +  fa  +  (4) 

where  ti  is  an  “error”  which  only  for  i1,  —  m,  will  have  mean  zero. 

It  is  then  important  to  find  a  good  method  of  testing  how  well  the  data  fits  the  line 
(3)  or  (4).  One  way  is  simply  to  measure  the  correlation  coefficient  r(z,  t)  between  the 
paired  sets  X(<)  and  U.  A  second  method  is  to  estimate  0  using  generalised  least  squares, 
and  to  compare  this  estimate  with  the  estimate  of  scale  given  by  the  sample  variance.  We 
now  examine  these  two  procedures. 


3.2  The  correlation  coefficient  as  test  statistic 

In  discussing  the  correlation  coefficient  r(x,  <),  we  extend  the  usual  meaning  of  correlation, 
and  also  that  of  variance  and  covariance,  to  apply  to  constants  as  well  as  random  variables. 
Thus  let  x  refer  to  the  vector  X(i), . . . ,  X(m),  and  t  to  the  vector  tj, . . . ,  t„;  let  x  =  £  X(,)/n 
and  t  —  £<</n,  and  define  the  sums 

0  =  —  /)  —  J]]  2(t)ti  —  nxi 

S(x,z)  =  £(i(0  -  i)1  =  £(*,  -  i)1 
S(t,t)  =  £(<•  -  <?■ 


5(x,  x)  will  often  be  called  S2. 

The  correlation  coefficient  between  x  and  t  is 


r(x,t) 


SM 

[5(x,x)5(t,  <)]»/>• 


(5) 


Statistics  r(x,  m)  or  ra(x,  m)  are  natural  statistics  for  testing  the  fit  of  x  to  the  model 
(3),  since  if  a  “perfect”  sample  is  given,  that  is,  a  sample  whose  ordered  values  fall  exactly 
at  their  expected  values,  r(x,m)  will  be  1;  more  generally,  the  value  of  r(x,m)  can  be 


Aspects  of  Goodneas-of-Fit 


7 


interpreted  as  a  measure  of  how  closely  the  sample  resembles  a  perfect  sample.  Tests 
based  on  r(x,m),  or  equivalently  on  will  be  one-tailed,  with  rejection  of  H0 

occurring  only  for  low  values  of  r. 

However,  as  n  — *  oo,  r2(x,m)  — ►  0  on  Ho.  A  statistic  which,  does  have  an  asymptotic 
distribution  is 

Z(x,  m)  =  n{l  —  ra(x,  m)}.  (6) 

Then  Z(x,  m)  is  an  equivalent  statistic  to  r3,  based  on  the  sum  of  squares  of  the  residuals 
after  the  line  (3)  has  been  fitted.  In  common  with  many  other  goodness-of-fit  statistics, 
for  example  chi-square  and  the  EDF  statistics,  Z(x,m)  has  the  property  that  the  larger 
it  is,  the  worse  the  fit.  Sarkadi  [4]  showed  consistency  of  the  test  based  on  r(x,  m)  for 
normality,  and  Gerlach  [5]  has  shown  consistency  for  correlation  tests  based  on  r(x,  m),  or 
equivalently  Z(x,m),  for  a  wide  class  of  distributions  including  all  the  usual  continuous 
distributions.  This  is  to  be  expected,  since,  for  large  n,  we  can  expect  our  sample  to 
become  perfect  in  the  sense  above.  We  can  expect  the  consistency  property  to  extend  to 
r(x,<)  provided  that  t  approaches  m  sufficiently  rapidly  for  large  samples. 

3.3  The  correlation  test  for  the  normal  distribution 

For  the  normal  distribution  /(to)  =  (2x)-1/3  exp(— to2^),  with  w  —  (x  -  p)/o\ 

thus  a  a*  p  and  /S  =  <r,  and  the  m,  are  the  expected  values  of  standard  normal  order 
statistics.  Equation  (3)  becomes 


£(*(,))  =  /*  +  <rmi- 


(7) 


For  the  normal  distribution  m  =  0,  and  r^z,  m)  can  conveniently  be  written  in  vector 
notation.  Let  x  be  the  vector  (z(i), . . .  ,Z(n)),  and  let  m  be  the  vector  (mj, . . .  ,mn);  let 
primes,  eg.  x'  and  m',  denote  transposes  of  vectors  or  matrices. 

Then 

r»(*,m)  =  J&L.  (8) 

v  ’  (m'ro)^3  v 


The  values  of  m,  required  for  the  calculation  of  r^Xjm)  have  been  well  tabulated,  and 
good  computer  programs  are  also  available. 

This  statistic  will  later  on  be  seen  to  be  identical  to  W\  the  Shapiro- Francia  statistic, 
so  that,  for  testing  normality,  we  shall  refer  to  r^Xjm)  also  as  W'.  Tables  for  W'  have 
been  given  by  Shapiro  and  Francia  [6]. 

In  practice,  it  is  easier  to  interpolate  in  tables  of  Z(x,m)  rather  than  1^(1,  m),  and 
Stephens  [7]  has  produced  tables  for  Z(z,m)  for  both  complete  and  censored  samples. 
The  null  hypothesis  that  the  sample  comes  from  a  normal  distribution  is  rejected  for 
large  values  of  Z(x,m). 
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De  Wet  and  Venter  [8]  have  proposed  the  use  of  the  statistic  r(x,  H),  where  Hi  = 
•  Use  of  Hi  makes  distribution  theory  easier,  and  de  Wet  and  Venter  have  given 
the  asymptotic  null  distribution  of  Z(x,H).  The  if,  must  be  found  numerically,  using  one 
of  the  excellent  approximations  available  for  The  values  of  Hx  and  m;  are  close  in 

the  middle  of  the  sample,  but  are  wider  apart  at  the  extremes.  However,  in  Leslie  et.  al. 
[9]  it  is  shown  that  Z(x,  H)  and  Z(x,m)  have  the  same  null  asymptotic  distributions. 


3.4  The  Shapiro- Wilk  procedure 

We  next  turn  to  the  second  method  of  testing  mentioned  above,  in  which  the  parameters 
a  and  p  in  the  model  X(,j  =  a  +  firm  are  estimated  by  generalised  least  squares.  Using 
our  previous  notation,  let  tv (,)  be  the  order  statistics  from  F(tv)  with  a  =  0  and  /?  =  1;  let 
m,  a*  E(w(i))  as  before,  and  let  E(w^)  —  mj)(tO(j)  —  m,)  =  V$„  the  covariance  of  w^)  and 
W(j).  Then  let  x  be  the  column  vector  with  components  X(i), . . . , X(„),  let  m  be  a  column 
vector  with  components  mi, . . . ,  m„,  and  let  1  be  a  column  vector  with  each  component 
equal  to  1.  Let  V  be  the  matrix  with  elements  Vij.  The  generalised  least  squares  estimates 
of  a  and  P  are  then 

d  =  —m'Gx  and  P  =  l'Gfx,  (9) 

where 

_  y»(l  7n'-ml')V-' 

~  (lV-»l)(mr->m)  -  (VV-'m)*'  K  } 

For  some  distributions,  for  example  the  normal  and  exponential,  these  equations  simplify 
considerably. 

A  method  of  testing  fit  has  been  proposed  by  Shapiro  and  Wilk  [10,  11]  for  testing 
normality  and  exponentiality.  The  procedure  used  is  basically  to  compare  the  estimate 
of  ft2  given  by  equation  (9)  with  the  estimate  of  @2  given  by  the  sample  variance;  the 
ratio  of  these  estimates,  multiplied  by  a  constant,  is  taken  as  the  test  statistic.  In  the 
case  of  tests  for  normality,  slight  modifications  of  the  first  estimate  of  f32  have  also  been 
suggested,  since  the  estimate  is  complicated  to  calculate. 

For  the  Shapiro-Wilk  test  for  normality,  a  and  P  in  (3)  are  p  and  a  respectively;  the 
estimates  of  these  parameters  given  by  (9)  then  become 

_  j  .  m'V~lx 
p  =  x  and  a  =  — . 

m'Vlm 

The  test  statistic  proposed  by  Shapiro  and  Wilk  [10]  is 


W  = 


S*C2 


where  S2  =  £(x(»)  —  *)3  =  TXxi  ~  z)2,  R2  —  m'V~lm,  and  C2  =  m'V~lV~lm.  The 
factors  R1  and  C2  ensure  that  W  always  takes  values  between  0  and  1. 
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Suppose  the  vector  a  is  defined  by  a  =  V~xmjC\  then 

W  ~  S’  S’ 

In  order  to  calculate  W,  the  vector  a  is  needed,  and  this  in  turn  requires  values  of  m 
and  V~x ,  derived  from  V.  For  values  of  n  between  21  and  50,  Shapiro  and  Wilk  used 
approximations  for  the  components  a,  of  a,  and  gave  a  table  of  values  of  a<  for  sample 
sizes  from  n  =  3  to  50.  They  also  gave  Monte  Carlo  points  with  which  to  make  the  test. 
The  test  is  one-tailed:  small  values  of  W  axe  significant. 

A  test  similar  to  W,  but  for  use  with  n  >  50,  was  later  suggested  by  Shapiro  and 
Francia  [6].  This  is  based  on  the  observation  of  Gupta  [12],  who  noted  that  the  estimate 
a  is  almost  the  same  if  V~x  is  ignored  in  equation  (10);  the  test  statistic  then  given  by 
Shapiro  and  Francia  is 

w’  -  (m/g)2 
(m'm)a5^ 

As  has  already  been  observed,  this  is  equivalent  to  the  sample  correlation  statistic  ra(x,  m). 


3.5  Asymptotic  equivalence  of  the  Shapiro- Wilk  and  correla¬ 
tion  statistics 

Thus  we  have  the  remarkable  result  that  the  Shapiro- Wilk  statistic  for  testing  normality 
approaches  the  correlation  coefficient  r^Xjm).  It  is  interesting  to  ask  why  this  is  so: 
why  V~x  can  be  “ignored”  when  calculating  W.  Stephens  [13]  has  shown  heuristically 
that,  for  large  n,  m  becomes  an  eigenvector  of  V,  and  Vm  — *  |m;  then  V~xm  — ►  2m, 
m'V~xx  — ►  2m'x,  and  m!V~xm  — >  2m'm.  Hence  W  — »  W'  because  the  factor  2  cancels 
in  the  numerator  and  denominator  of  W.  The  above  results  were  proved  rigorously  by 
Leslie  [14].  Stephens  [13]  also  gives  other  asymptotic  eigenvalues  and  eigenvectors  of  V. 


4  Power  comparisons 

Shapiro  and  Wilk  [10]  gave  power  results  for  W,  based  on  Monte  Carlo  studies.  Unfortu¬ 
nately,  the  comparisons  with  EDF  statistics  were  inaccurate  —  the  EDF  statistics  were 
compared  with  Case  0  tables,  and  not  the  Case  3  tables  to  be  used  when  the  parameters  p 
and  o  are  estimated  by  x  and  s.  Stephens  [15]  later  gave  comparisons  based  on  the  correct 
tables.  These  show  that  W  is  barely  superior  overall  to  EDF  statistics,  and  especially 
only  slightly  superior  to  the  Anderson-Darling  A3.  Both  statistics  tend  to  have  higher 
power  than  older  statistics  such  as  b\  and  63,  the  coefficients  of  skewness  and  kurtosis. 
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5  Two  questions 

The  above  results  show  that  W,  equivalent  to  the  correlation  coefficient  r3(x,  m),  appears 
to  give  overall  the  most  powerful  omnibus  test  for  normality.  Two  questions  can  then  be 
asked: 


(a)  Will  the  Shapiro- Wilk  procedure  give  good  tests  for  other  distributions? 

(b)  Will  the  correlation  coefficient  be  successful  for  testing  other  distributions? 

With  reference  to  the  first  question,  we  first  observe  that  for  other  distributions  tested, 
the  Shapiro- Wilk  procedure  will  not  necessarily  lead  to  a  statistic  which  is  asymptotically 
equivalent  to  the  correlation  coefficient. 

For  the  exponential  distribution,  where  F(x\  0)  =  1  —  exp{— (x  —  a)//?},  provided  x  > 
a,  (thus  F(w)  =  1  —  exp(— w),  and  9  —  (a,  /?)),  the  estimates  in  (3)  become 


a  =  X(x)  and  $  — 


n(x  -  x(1)) 

(n-1)  ‘ 


The  ratio  02/S2,  omitting  some  factors  involving  n,  leads  to  the  statistic 


WE  = 


n(x  -  x(1))a 
(n  - 1)5*  ' 


Although  this  statistic  has  been  proposed,  and  points  given,  for  testing  exponentiality 
(Shapiro  and  Wilk  (11],  Currie  [16]),  it  has  not  proved  powerful  (Stephens  [3]).  Fur¬ 
thermore,  it  does  not  provide  a  consistent  test,  which  means  that  there  will  be  some 
distributions,  not  exponential,  which  would  not  be  detected  with  power  approaching  1, 
when  the  test  for  exponentiality  is  applied  to  large  samples.  Sarkadi  [4]  first  pointed  this 
out,  by  observing  that  WE  is  equivalent,  for  large  samples,  to  the  coefficient  of  variation 
(CV)  of  the  sample.  To  fix  ideas,  suppose  a  is  known  to  be  zero  (this  is  frequently  the 
case  when  the  exponential  distribution  is  used,  although  the  discussion  which  follows  is 
easily  adapted  to  the  case  where  a  is  not  zero).  Then,  for  large  n,  X(i)  — *  a  =  0,  and 
WE  — »  x3/53.  The  coefficient  of  variation  is  S’3/®3,  so  that  WE  — » 1/CV,  and  for  large  n, 
this  is  1.  However,  many  distributions  have  CV  =  1,  and  a  very  large  sample  from  one  of 
these  will  have  a  WE  also  approaching  1.  The  power  of  WE  will  then  approach  a  constant 
(less  than  1)  depending  on  the  variance  of  WE. 

Spinelli  and  Stephens  [17]  have  given  power  studies  with  samples  taken  from  some 
other  distributions  with  CV  =  1,  where  the  power  of  WE  is  seen  to  diminish  as  the 
sample  size  n  increases.  Lockhart  and  Stephens  [18]  have  explored  the  question  of  non¬ 
consistency  further,  and  have  shown  that  only  for  a  very  limited  family  of  distributions, 
including  the  normal,  does  the  Shapiro- Wilk  procedure  give  a  consistent  test.  Thus  this 
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technique  of  basing  a  test  on  the  ratio  of  the  regression  estimate  of  scale  to  that  given  by 
the  sample  standard  deviation  cannot  be  recommended  except  for  the  normal  case. 

We  now  turn  to  the  second  question  above.  Since  the  correlation  coefficient  is  powerful 
for  testing  normality,  will  it  be  equally  successful  for  tests  on  other  distributions?  First, 
it  should  be  emphasised  that  the  most  appropriate  correlation  is  that  between  x  and 
m:  in  the  normal  case,  Hi  =  $-1{t/(n  +  1)}  was  "sufficiently  close”  to  m  that  the 
correlations  r(x,  H)  and  r(x,  m)  were  approximately  equal  for  large  samples,  and  so  had 
the  same  power.  This  is  not  so  for  other  distributions.  For  example,  for  the  exponential 
distribution,  rm  —  +  1  —  j)-15  and  Hi  =  —  log{l  —  t/(n  + 1)},  and  these  are  not 

close  enough  in  the  tails  to  give  r(x,H)  as  much  power  as  r(x,m).  We  can  expect  this 
result  to  be  true  also  for  other  long-tailed  distributions,  and  the  question  of  when  r(x,  H) 
can  replace  r(x,m)  for  those  distributions  for  which  m  is  hard  to  calculate  is  itself  an 
interesting  research  topic.  See,  for  example,  McLaren  and  Lockhart  [19]. 

We  therefore  confine  further  discussion  to  the  properties  of  r(x,m).  We  have  seen 
that,  in  contrast  to  W,  r(x,m)  always  gives  a  consistent  test.  However,  McLaren  and 
Lockhart  [19]  show  that  the  asymptotic  relative  efficiency  of  correlation  tests  can  be  zero 
compared  with  EDF  tests. 

Stephens  [20]  adapted  We  to  test  exponentiality  in  the  case  where  a  is  known;  this 
can  be  compared  with  most  power  studies  on  other  statistics  which  usually  assume  a  =  0. 
Stephens  Ste86c  gives  some  tables  for  comparison,  and  these  demonstrate  that  the  W 
statistics  are  in  general  less  powerful  than  EDF  statistics. 


6  Censored  data 

One  attraction  of  correlation  statistics  is  the  fact  that  the  correlation  coefficient  is  well- 
known  to  most  applied  statisticians,  and  the  formula  is  very  easy  to  calculate.  This  is  true 
also  for  censored  observations  of  types  I  or  II,  where  missing  observations  are  all  at  one 
end  of  the  sample,  often  in  the  right-hand  tail  where  higher  values  occur.  Because  of  this 
appeal,  Stephens  [7],  as  was  stated  earlier,  gives  many  tables  of  Z(x,  to)  =  n{l— r2(x,  to)}, 
or  of  the  corresponding  Z(x,H),  for  use  with  right-censored  data  and  for  testing  the 
exponential,  Weibull  and  other  distributions.  EDF  statistics  have  also  been  adapted 
for  censored  data,  and  formulas  and  tables  for  these  statistics  are  given  by  Stephens 
[2].  For  censored  data,  as  for  full  samples,  the  statistics  Z(x,m)  and  Z(x,H)  may  not 
be  as  powerful  in  general  as  EDF  statistics;  much  depends  on  the  influence  of  the  tail 
observations  which  are  lost  by  censoring.  Finally,  randomly  censored  data  poses  a  unique 
problem  in  testing  fit.  The  Kaplan-Meier  estimate  of  F(x)  can  be  used  for  EDF  statistics, 
and  r^x,  to)  can  still  be  calculated  if  it  is  known  which  ordered  observations  have  been 
lost,  but  in  either  case  tables  are  difficult  to  provide.  More  work  is  needed  on  this  topic. 
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7  Tests  for  discrete  distributions 


Until  now,  tests  have  been  discussed  only  for  continuous  distributions.  The  correlation 
coefficient  and  the  Shapiro- Wilk  procedure  do  not  adapt  readily  to  discrete  distributions, 
but  EDF  tests  can  be  adapted.  The  technique  is  based  on  measures  of  discrepancy 
between  the  cumulative  histograms  of  observed  values  and  expected  values  (Pearson’s  x3 
measures  the  discrepancy  within  each  cell,  and  does  not  sum  the  observeds  and  expecteds). 
Pettitt  and  Stephens  [21]  gave  some  distribution  theory  for  the  Kolmogorov-Smirnov 
statistic  for  testing  uniformity,  and  Freedman  [22]  discussed  the  Watson  IP  statistic,  one 
of  the  Cramer-von  Mises  family,  for  the  discrete  uniform  test.  Recently,  Lockhart  and 
Stephens  [23]  have  extended  the  test  to  include  W7  and  A3,  and  Spinelli  and  Stephens 
[24]  develop  a  test  for  the  Poisson  distribution  using  Cramer-von  Mises  statistics.  Power 
studies  show  these  tests  to  be  quite  effective.  In  particular,  for  the  test  for  normality,  the 
EDF  statistics  will  be  more  powerful  than  Pearson’s  x3  when  the  alternative  is  a  trend 
in  the  cell  probabilities  —  for  example,  to  test  that  the  probability  of  a  defective  item 
produced  in  a  factory  is  the  same  each  week,  against  the  alternative  that  it  decreases  with 
time. 


8  Summary  and  final  remarks 


In  this  paper  we  have  reviewed  two  important  methods  of  testing  fit  —  EDF  statistics 
and  regression  methods  based  on  the  probability  plot.  Tests  based  on  the  EDF  and  those 
based  on  the  correlation  coefficient  r(x,m)  are  consistent ,  whereas  those  derived  from 
use  of  the  Shapiro- Wilk  procedure  are  not  consistent  in  general.  The  exception  is  the 
test  for  normality.  For  large  samples,  the  correlation  coefficient,  however,  can  have  low 
efficiency  compared  with  EDF  tests.  For  smaller  samples,  and  for  censored  data,  the 
situation  is  less  clear,  and  more  work  is  needed.  Of  course,  other  techniques  for  testing 
fit  exist,  based  on  Pearson’s  x3>  on  spacings,  or  on  the  empirical  characteristic  function. 
In  general,  for  tests  for  continuous  distributions,  Pearson’s  x3  has  low  power  compared 
with  EDF  statistics,  due  to  the  loss  of  information  resulting  from  the  grouping  required. 
EDF  statistics  compare  well  with  the  other  methods  also,  and,  for  overall  testing  against 
omnibus  alternatives,  these  statistics  are  recommended.  For  specified  limited  alternatives, 
clearly  other  tests  (for  example  the  Likelihood  Ratio  test)  can  have  good  properties. 
Stephens  [2,  3,  7]  discusses  these  issues,  but  much  more  research  can  be  done,  both  on 
mathematical  aspects  of  the  statistics,  and  on  practical  comparisons  of  tests. 
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In  this  article,  two  important  method*  of  testing  fit  to  a  distribution  are  discussed  sad  compared. 
They  are  the  family  of  tests  baaed  on  the  empirical  distribution  function  of  a  random  sample, 
and  the  family  based  on  plotting  the  order  statistics  against  a  suitable  set  of  constants  end 
the  fit  of  a  line  through  the  plotted  points.  The  two  sets  will  be  celled  EDF  tests 
and  Regression  tests  respectively. 


